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Complex Background Subtraction by Pursuing 
Dynamic Spatio-Temporal Models 

Liang Lin, Yuanlu Xu, Xiaodan Liang, Jianhuang Lai 


Abstract —Although it has been widely discussed in video 
surveillance, background subtraction is still an open problem 
in the context of complex scenarios, e.g., dynamic backgrounds, 
illumination variations, and indistinct foreground objects. To 
address these challenges, we propose an effective background 
subtraction method by learning and maintaining an array of 
dynamic texture models within the spatio-temporal representa¬ 
tions. At any location of the scene, we extract a sequence of 
regular video bricks, i.e. video volumes spanning over both spatial 
and temporal domain. The background modeling is thus posed 
as pursuing subspaces within the video bricks while adapting 
the scene variations. For each sequence of video bricks, we 
pursue the subspace by employing the ARMA (Auto Regressive 
Moving Average) Model that jointly characterizes the appearance 
consistency and temporal coherence of the observations. During 
online processing, we incrementally update the subspaces to cope 
with disturbances from foreground objects and scene changes. 
In the experiments, we validate the proposed method in several 
complex scenarios, and show superior performances over other 
state-of-the-art approaches of background subtraction. The em¬ 
pirical studies of parameter setting and component analysis are 
presented as well. 

Index Terms —Background modeling; visual surveillance; 
spatio-temporal representation. 

1. Introduction 

Background subtraction (also referred as foreground ex¬ 
traction) has been extensively studied in decades Q, 1^. (31. 
n, im, yet it still remains open in real surveillance 
applications due to the following challenges: 

• Dynamic backgrounds. A scene environment is not 
always static but sometimes highly dynamic, e.g., rippling 
water, heavy rain and camera jitter. 

• Lighting and illumination variations, particularly with 
sudden changes. 

• Indistinct foreground objects having similar appearances 
with surrounding backgrounds. 
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Fig. 1. Some challenging scenarios for foreground object extraction are 
handled by our approach: (i) a floating bottle with randomly dynamic water 
(in the left column), (ii) waving curtains around a person (in the middle 
column), and (iii) sudden light changing (in the right column). 

In this paper, we address the above mentioned difficulties 
by building the background models with the online pursuit of 
spatio-temporal models. Some results generated by our system 
for the challenging scenarios are exhibited in Fig. . Prior to 
unfolding the proposed approach, we first review the existing 
works in literature. 

A. Related Work 

Due to their pervasiveness in various applications, there is 
no unique categorization on the existing works of background 
subtraction. Here we introduce the related methods basically 
according to their representations, to distinguish with our 
approach. 

The pixel-processing approaches modeled observed scenes 
as a set of independent pixel processes, and they were widely 
applied in video surveillance applications . In these 

methods ID, m, la, 12, each pixel in the scene can be 
described by different parametric distributions (e.g. Gaussian 
Mixture Models) to temporally adapt to the environment 
changes. The parametric models, however, were not always 
compatible with real complex data, as they were defined 
based upon some underlying assumptions. To overcome this 
problem, some other non-parametric estimations Co), CD, 
II2, C3 were proposed, and effectively improved the ro¬ 
bustness. For example, Barnich et al. C3 presented a sample- 
based classification model that maintained a fixed number of 
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samples for each pixel and classified a new observation as 
background when it matched with a predefined number of 
samples. Liao et al. ifT^ recently employed the kernel density 
estimation (KDE) technique to capture pixel-level variations. 
Some distinct scene variations, i.e. illumination changes and 
shadows, can be explicitly alleviated by introducing the extra 
estimations ca. Guy on et al. ca proposed to utilize the 
low rank matrix decomposition for background modeling, 
where the foreground objects constituted the correlated sparse 
outliers. Despite acknowledged successes, this category of 
approaches may have limitations on complex scenarios, as the 
pixel-wise representations overlooked the spatial correlations 
between pixels. 

The region-based methods built background models by 
taking advantages of inter-pixel relations, demonstrating im¬ 
pressive results on handling dynamic scenes. A batch of 
diverse approaches were proposed to model spatial struc¬ 
tures of scenes, such as joint distributions of neighboring 
pixels CD, lEl, block-wise classifiers (TSl, structured adja¬ 
cency graphs 1191, auto-regression models 1^ . 11211 . random 
fields I22I . and multi-layer models 1^ etc. And a number 
of fast learning algorithms were discussed to maintain their 
models online, accounting for environment variations or any 
structural changes. For example, Monnet et al. ll^ trained and 
updated the region-based model by the generative subspace 
learning. Cheng et al. CD employed the generalized 1-SVM 
algorithm for model learning and foreground prediction. In 
general, methods in this category separated the spatial and 
temporal information, and their performances were somewhat 
limited in some highly dynamic scenarios, e.g. heavy rains or 
sudden illumination changes. 

The third category modeled scene backgrounds by exploit¬ 
ing both spatial and temporal information. Mahadevan et 
al. 1241 proposed to separate foreground objects from sur¬ 
roundings by judging the distinguished video patches, which 
contained different motions and appearances compared with 
the majority of the whole scene. Zhao et al. 1251 addressed the 
outdoor night background modeling by performing subspace 
learning within video patches. Spatio-temporal representa¬ 
tions were also extensively discussed in other vision tasks 
such as action recognition 1^ and trajectory parsing ED. 
These methods motivated us to build models upon the spatio- 
temporal representations, i.e. video bricks. 

In addition, several saliency-based approaches provided 
alternative ways based on spatio-temporal saliency estima¬ 
tions ll28l , 1241 . l(29l . The moving objects can be extracted 
according to their salient appearances and/or motions against 
the scene backgrounds. For example, Wixson et al. 1^ de¬ 
tected the salient objects according to their consistent moving 
directions over time. Kim et al. ISOl used a discriminant center- 
surround hypothesis to extract foreground objects around their 
surroundings. 

Along with the above mentioned background models, a 
number of reliable image features were utilized to better han¬ 
dle the background noise ED. Exemplars included the Local 
Binary Pattern (LBP) features (321, (331, EH and color texture 
histograms l35l . The LBP operators described each pixel by 
the relative graylevels of its neighboring pixels, and their 


effectiveness has been demonstrated in several vision tasks 
such as face recognition and object detection (32l . (36l . (37l. 
The Center-Symmetric LBP was proposed in fS4\ to further 
improve the computational efficiency. Tan and Triggs (33l 
extended LBP to LTP (Local Ternary Pattern) by thresholding 
the graylevel differences with a small value, to enhance the 
effectiveness on fiat image regions. 

B. Overview 

In this work, we propose to learn and maintain the dy¬ 
namic models within spatio-temporal video patches (i.e. video 
bricks), accounting for real challenges in surveillance scenar¬ 
ios m . The algorithm can process 15 ^ 20 frames per second 
in the resolution 352 x 288 (pixels) on average. We briefiy 
overview the proposed framework of background modeling in 
the following aspects. 

1. Spatio-temporal representations. We represent the ob¬ 
served scene by video bricks, i.e. video volumes spanning 
over both spatial and temporal domain, in order to jointly 
model spatial and temporal information. Specifically, at every 
location of the scene, a sequence of video bricks are extracted 
as the observations, within which we can learn and update 
the background models. Moreover, to compactly encode the 
video bricks against illumination variations, we design a brick- 
based descriptor, namely Center Symmetric Spatio-Temporal 
Local Ternary Pattern (CS-STLTP), which is inspired by the 
2D scale invariant local pattern operator proposed in Cl. Its 
effectiveness is also validated in the experiments. 

2. Pursuing dynamic subspaces. We treat each sequence 
of video bricks at a certain location as a consecutive signal, 
and generate the subspace within these video bricks. The 
linear dynamic system (i.e. Auto Regressive Moving Average, 
ARMA model (38l) is adopted to characterize the spatio- 
temporal statistics of the subspace. Specifically, given the 
observed video bricks, we express them by a data matrix, 
in which each column contains the feature of a video brick. 
The basis vectors (i.e. eigenvectors) of the matrix can be then 
estimated analytically, representing the appearance parameters 
of the subspace, and the parameters of dynamical variations are 
further computed based on the fixed appearance parameters. 
It is worth mentioning that our background model jointly 
captures the information of appearance and motion as the data 
(i.e. features of the video bricks) are extracted over both spatial 
and temporal domains. 

3. Maintaining dynamic subspaces online. Given the 
newly appearing video bricks with our model, moving fore¬ 
ground objects are segmented by estimating the residuals 
within the related subspaces of the scene, while the back¬ 
ground models are maintained simultaneously to account for 
scene changes. The raising problem is to update parame¬ 
ters of the subspaces incrementally against disturbance from 
foreground objects and background noise. The new obser¬ 
vation may include noise pixels (i.e. outliers), resulting in 
degeneration of model updating {25\ , (201. Furthermore, one 
video brick could be partially occluded by foreground objects 
in our representation, i.e. only some of pixels in the brick 
are true positives. To overcome this problem, we present a 
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Fig. 2. An example of computing the CS-STLTP feature. For one pixel in the video brick, we construct four spatio-temporal planes. The center-symmetric 
local ternary patterns for each plane is calculated, which compares the intensities in a center-symmetric direction with a contrasting threshold r. The CS-STLTP 
feature is concatenated by the vectors of the four planes. 


novel approach to compensate observations {i.e. the observed 
video bricks) by generating data from the current models. 
Specifically, we replace the pixels labeled as non-background 
by the generated pixels to synthesize the new observations. 
The algorithm for online model updating includes two steps: 
(i) update appearance parameters using the incremental sub¬ 
space learning technique, and (ii) update dynamical variation 
parameters by analytically solving the linear reconstruction. 
The experiments show that the proposed method effectively 
improves the robustness during the online processing. 

The remainder of this paper is arranged as follows. We first 
present the model representation in Section |I^ and then discuss 
the initial learning, foreground segmentation and online updat¬ 
ing mechanism in Section |I^ respectively. The experiments 
and comparisons are demonstrated in Section and finally 
comes the conclusion in Section [V| with a summary. 


II. Dynamic Spatio-temporal Model 

In this section, we introduce the background of our model, 
and then discuss the video brick representation and our model 
definition, respectively. 


A. Background 

In general, a complex surveillance background may include 
diverse appearances that sometimes move and change dynam¬ 
ically and randomly over time flying (391 . There is a branch 
of works on time-varying texture modeling (401 , (411 . (4^ 
in computer vision. They often treated the scene as a whole, 
and pursued a global subspace by utilizing the linear dynamic 
system (LDS). These models worked well on some natural 
scenes mostly including a few homogeneous textures, as the 
LDS characterizes the subspace with a set of linearly combined 
components. However, under real surveillance challenges, it 
could be intractable to pursue the global subspace. In this 
work, we represent the observed scene by an array of small and 
independent subspaces, each of which is defined by the linear 
system, so that our model is able to handle better challenging 
scene variations. Our background model can be viewed as a 


mixed compositional model consisting of the linear subspaces. 
In particular, we conduct the background subtraction with our 
model based on the following observations. 

Assumption 1: The local scene variants {i.e. appearance 
and motion changing over time) can be captured by the low¬ 
dimensional subspace. 

Assumption 2: It is feasible to separate foreground moving 
objects from the scene background by fully exploiting spatio- 
temporal statistics. 


B. Spatio-temporal Video Brick 

Given the surveillance video of one scene, we first decom¬ 
pose it with a batch of small brick-like volumes. We consider 
the video brick of small size (^.g., 4 x 4 x 5 pixels) includes 
relative simple content, which can be thus generated by few 
bases (components). And the brick volume integrates both 
spatial and temporal information, that we can better capture 
complex appearance and motion variations compared with the 
traditional image patch representations. 

We divide each frame Ii , (i = 1, 2,..., n) into a set of 
image patches with the width w and height h. A number t of 
patches at the same location across the frames are combined 
together to form a brick. In this way, we extract a sequence 
of video bricks V = {ui, U 2 ,..., at every location for the 
scene. 

Moreover, we design a novel descriptor to describe the video 
brick instead of using RGB values. For any video brick 
we first apply the CS-STLTP operator on each pixel, and pool 
all the feature values into a histogram. For a pixel Xc, we 
construct a few 2D spatio-temporal planes centered at it, and 
compute the local ternary patterns (LTP) operator (3^ on each 
plane. The CS-STLTP then encodes Xc by combining the LTP 
operators of all planes. Note that the way of splitting spatio- 
temporal planes little affects the operator’s performance. To 
simplify the implementation, we make the planes parallel to 
the Y axis, as Fig. shown. 

We index the neighborhood pixels of x by {0,..., M}, the 
operator response of the j-th plane can be then calculated as: 
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2 ^ 

{x) = y Sr{PrmPm-\-^)^ 

m=0 

where pixel k and k ^ M/2 are two symmetric neighbors of 
pixel Xc. Pk and Pj._^m_ are the graylevels of the two pixels, 
respectively. The sign (jj indicates stretching elements into a 
vector. The function Sr is defined as follows: 

r 1, if Pm > + 

ST{Pm,Pm+f) = \ if Pm < (1 -'r)Pm+M, (2) 

[ 0, otherwise. 

where r is a constant threshold for the comparing range. 

Suppose that we take M = 8 neighborhood pixels for 
computing the operator in each spatio-temporal plane, and 
the number of planes is 4. The resulting CS-STLTP vector 
contains M/2 x 4 = 16 bins. Figillustrates an example 
of computing the CS-STLTP operator, where we apply the 
operator for one pixel on 4 spatial-temporal planes displayed 
with different colors (e.g., green, blue, purple and orange). 

Then we build a histogram for each video brick by accumu¬ 
lating the CS-STLTP responses of all pixels. This definition 
was previously proposed by Guo et al |[36l . 

= ke[0,K], (3) 

where l(a, b) is an indicator function, i.e. l(a, b) = 1 only 
if a = b. To measure the operator response, we transform 
the binary vector of CS-STLTP into a uniform value that is 
defined as the number of spatial transitions (bitwise changes) 
following, as discussed in |[36l . For example, the pattern 
(i.e. the vector of 16 bins) 0000000000000000 has a value 
of 0 and 1000000000000000 of 1. In our implementation, we 
further quantize all possible values into 48 levels. To further 
improve the capability, we can generate histograms in each 
color channel and concatenate them together. 

The proposed descriptor is computationally efficient and 
compact to describe the video brick. In addition, by intro¬ 
ducing a tolerative comparing range in the LTP operator 
computation, it is robust to local spatio-temporal noise within 
a range. 

C. Model Definition 

Let m be the descriptor length for each brick, and V = 
{vi^V 2 ^ ..., Vi G be a sequence of video bricks at a 
certain location of the observed background. We can use a set 
of bases (components) C = [Ci, C 2 ,..., Cd] to represent the 
subspace where V lies in. Each video brick Vi in V can be 
represented as 

d 

Vi ='^ ZijCj + oji, ( 4 ) 

i=i 

where Cj is the j-th basis (j-th column of matrix C) of the 
subspace, Zij the coefficient for Cj, and uji the appearance 
residual. We denote C to represent appearance consistency of 
the sequence of video bricks. In some traditional background 


models by subspace learning, Zij can be solved and kept as a 
constant, with the underlying assumption that the appearance 
of background would be stable within the observations. In 
contrast, we treat Zij as the variable term that can be further 
phrased as the time-varying state, accounting for temporally 
coherent variations (i.e. the motions). For notation simplicity, 
we neglect the subscript j, and denote Z = {zi, Z 2 , ..., Zn} 
for all the bricks. The dynamic model is formulated as, 

Zi+i = Azi + 7]i, ( 5 ) 

where 77 * is the state residual, and A is a matrix of d x d 
dimensions to model the variations. With this definition, we 
consider A representing the temporal coherence among the 
observations. 

Therefore, the problem of pursuing dynamic subspace is 
posed as solving the appearance consistency C and the tempo¬ 
ral coherence A, within the observations. Since the sequence 
states Z are unknown, we shall jointly solve C, A, Z by 
minimizing an empirical energy function A, Z): 

min J='n{C,A,Z) = L ||t)j - Czi\\l + \\zi - Azi-i\\l. 
i=l 

( 6 ) 

Here Tnif^^A^Z) is not completely convex but we can solve 
it by fixing either Z or (0,^4). Nevertheless, its computation 
cost is expensive for learning the entire background online. 
Here we simplify the dynamic model in Equation ^ into a 
linear system, following the auto-regressive moving average 
(ARMA) process. In literature, Soatto et al. 1401 originally 
associated the output of ARMA model with dynamic textures, 
and showed that the first-order ARMA model, driven by 
white zero-mean Gaussian noise, can capture a wide range of 
dynamic textures. In our approach, the difficulty of modeling 
the dynamic variations can be alleviated due to the brick-based 
representation, i.e. the observed scene is decomposed into 
video bricks. Thus, we consider the ARMA process a suitable 
solution to model the time-varying variables, which can be 
solved efficiently. Specifically, we introduce a robustness term 
(i.e. matrix) B, which includes a number of bases, and we 
set r]i = Bci, where denotes the noise. 

We further summarize the proposed dynamic model, where 
we add the subscript n to the main components, indicating 
they are solved within a number n of observations, as, 

Vi = CnZi ^ iUi, 

Zi-\-i — Afi Zi “h Bji (7) 

Wi 7V(0,E^), iV(0,JdJ. 

In this model, G and An G represent the 

appearance consistency and temporal coherence, respectively. 
Bn G is the robustness term constraining the evolution 

of Z over time. uJi G indicates the residual corresponding 
to observation Vi, and G the noise of state variations. 
During the subspace learning, uJi and are assumed to follow 
the zero-mean Gaussian distributions. Given a new brick 
mapped into the subspace, uJi and can be used to measure 
how likely the observation is suitable with the subspace, so 
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that we utilize them for foreground object detection during 
online processing. 

The proposed model is time-varying, and the parameters 
Cn, An, Bn Can be updated incrementally along with the 
processing of new observations, in order to adapt our model 
with scene changes. 

III. Learning Algorithm 

In this section, we discuss the learning for spatio-temporal 
background models, including initial subspace generation and 
online maintenance. The initial learning is performed at the 
beginning of system deployment, when only a few foreground 
objects move in the scene. Afterwards, the system switches to 
the mode of online maintenance. 

A. Initial Model Learning 

In the initial stage, the model defined in Equation 0 can 
be degenerated as a non-dynamic linear system, as the n 
observations are extracted and fixed. Given a brick sequence 
V = we present an algorithm to identify 

the model parameters C^, Bn, following the sub-optimal 
solution proposed in ll40l . 

To guarantee the Equation 0 has an unique and canonical 
solution, we postulate 

n>d, Rank(C„) = d, C^C„ = Id, (8) 

where Id is the identity matrix of dimension d x d. The 
appearance consistency term can be estimated as, 

C„ = arg min \ Wn - Cn[ Zi Z2 ■ ■ ■ Zn] \ (9) 

G-n, 

where Wn is the data matrix composed of observed video 
bricks [t’l, U 2 , • • • , t’n]- The equation. 0 satisfies the full rank 
approximation property and can be thus solved by the singular 
value decomposition (SVD). We have, 

Wn = C/SQT, (10) 

U^U = I, = J, 

where Q is the unitary matrix, U includes the eigenvectors, 
and S is the diagonal matrix of the singular values. Thus, 
is treated as the first d components of U, and the state matrix 
[zi Z 2 • • • Zn] 2 iS the product ofdxd sub-matrix of E and 
the first d columns of 

The temporal coherence term An is calculated by solving 
the following linear problem: 


An = arg min j [ Z 2 Z 3 ■■■ Zn ] - Anl zi Z 2 ■■■ Zn-i ] |. 

-^n 


( 11 ) 

The statistical robustness term Bn is estimated by the recon¬ 
struction error E 


E — [ ^2 ^3 ' ' ' ] An [ ^2 ' ' ' ^n —1 ] 

~ Bn [ ^2 ' ' ' ^n —1 ]: 


(12) 


where Bn = E. Since the rank of An is d and d 
the rank of input-to-state noise d^ is assumed to be much 
smaller than d. That is, the dimension of E can be further 
reduced by SVD: E = Qj, and we have 

1 


( 13 ) 

The values of d, d^ essentially imply the complexity of 
subspace from the aspects of appearance consistence and 
temporal coherence, respectively. Eor example, video bricks 
containing static content can be well described with a function 
of low dimensions while highly dynamic video bricks (e.g., 
from an active fountain) require more bases to generate. In 
real surveillance scenarios, it is not practical to pre-determine 
the complexity of scene environments. Hence, in the proposed 
method, we adaptively determine d, d^ by thresholding eigen¬ 
values in E and Eg, respectively. 

d* = arg max > Td^ 

^ ( 14 ) 

d* = arg max Sf ^ > Td ^, 

where indicates the d-th eigenvalue in E and the d^-th 
eigenvalue in E^. 


Bn = 




-1 ^ 




B. Online Model Maintenance 

Then we discuss the online processing with our model that 
segments foreground moving objects and keeps the model 
updated. 

(I) Foreground segmentation. Given one newly appearing 
video brick Vn+i, we can determine whether pixels in Vn+i 
belong to the background or not by thresholding their appear¬ 
ance residual and state residual. We first estimate the state of 
Vn+i with the existing C^, 

^n+l ~ '^n+l? (13) 

and further the appearance residual of Vn+i 

^n+l — '^n+1 E!>n^n-\-l’ (1^) 

As the state Zn and the temporal coherence An have been 
solved, we can then estimate the state residual according 
to Equation 0> 


Bn^n — ^n+1 AnZn 
=>672 = pillY(^Bn) AnZn\ 

where pinv denotes the operator of pseudo-inverse. 

With the state residual and the appearance residual ccn+i 
for the new video brick Vn+i, we conduct the following criteria 
for foreground segmentation, in which two thresholds are 
introduced. 

1) Vn+i is classified into background, only if all dimensions 
of Cn are less than a threshold T^. 
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2) If Vn-\-i has been labeled as non-background, perform 
the pixel-wise segmentation by comparing ccn+i with a 
threshold : the pixel is segmented as foreground if its 
corresponding dimension in ccn+i is greater than 
(II) Model updating. During the online processing, the 
key problem for model updating is to deal with foreground 
disturbance, i.e. to avoid absorbing pixels from foreground 
objects or noise. 

In this work, we develop an effective approach to update 
the model with the synthesized data. We first generate a video 
brick from the current model, namely noise-free brick, fin+i? 
as 




(18) 


Then we extract pixels from to compensate occluded {i.e. 
foreground) pixels in the newly appearing brick. Concretely, 
the pixels labeled as non-background are replaced by the pixels 
from the noise-free video brick at the same place. We can thus 
obtain a synthesized video brick Vn+i for model updating. 

Given the brick fin+i, the data matrix Wn composed of 
observed video bricks is extended to Wn+i- Then we update 
the model C^+i according to Equation 0- 

Our algorithm of model updating includes two steps: (i) 
update parameters for appearance consistency by em¬ 

ploying the incremental subspace learning technique, and (ii) 
update parameters of state variations 5n+i- 

(i) Step 1. For the d-dimension subspace, with eigenvectors 
and eigenvalues A^, its covariance matrix Cov^ can be 
approximated as 


d 

QoYyi ~ ^ ^ ~ ’ (19) 

i=i 

where Cnj and Xnj denote the j-th eigenvector and eigen¬ 
value, respectively. With the newly synthesized data fin+i, the 
updated covariance matrix Cov^+i is formulated as 

CoVn+l = (l-a)CoVn + ^ fin+1 ^n+l 

{l-a)CnKCl + afin+lfin+1 

d (^^) 

~ ^ ^ (1 ^n,i ^n,i ^n,i 

i=l 

where a denotes the learning rate. The covariance matrix can 
be further re-formulated to simplify computation, as. 


{Yji^l 1^+1) ^n+l,j — j — 1, 2, . . . , d 

where Cn+ij and An+i,j 


-1, 
(23) 

are the j-th eigenvector and 


eigenvalue of matrix respectively. Let = 

and we re-write Equation (23) as 


C0V72-I-I J = 1, 2, . . . , d -|- 1. 

(24) 

We thus obtain the updated eigenvectors C^+i and the cor¬ 
responding eigenvalues A^+i of the new covariance matrix 
Covn+i. Note that the dimension of the subspace is automat¬ 
ically increased along with the newly added data fin+i- To 
guarantee the appearance parameters remain stable, we keep 
the main principal {i.e. top d) eigenvectors and eigenvalues 
while discarding the least significant components. 

The above incremental subspace learning algorithm has 
been widely applied in several vision tasks such as face 
recognition and image segmentation Ea, mi, ia, and also 
for background modeling in (251, ESI, (41. However, the noise 
observations caused by moving objects or scene variations 
often disturb the subspace maintenance, e.g. the eigenvectors 
could change dramatically during the processing. Many ef¬ 
forts Il77lll48l have been dedicated to improve the robustness 
of incremental learning by using statistical analysis. Several 
discriminative learning algorithms (49l were also employed to 
train background classifiers that can be incrementally updated. 
In this work, we utilize a version of Robust Incremental PCA 
(RIPCA) 1501 to cope with the outliers in ^n+i- Note that 
Vri-\-i consists of pixels either from the generated data 
or real videos, where outliers may exist in some dimensions. 

In the traditional PCA learning, the solution is derived by 
minimizing a least-squared reconstruction error, 

min|r„+ip = |C„C^v„+i - ■D„+ip. (25) 

Following (50l . we impose a robustness function w{t) = 
, fj, over each dimension of r„+i, and the target can be 
re-defined as. 


^ w(r^+l)(r^+l)^ (26) 

j 

where the superscript k indicates the k-th dimension. The 
parameter p in the robustness function is estimated by 


Cov„+i = (21) 

where F„+i = [y„+i,i 2 /n+i .2 • • • yn+i,d+i] and each column 
Vn+ij in l^n+i is defined as 


r \Xf ^Ajf if 1 J <C d, 

\ ^Vn+I, if j = d + 1. 


( 22 ) 


To reduce the computation cost, we can estimate C^+i by a 
smaller matrix instead of the original large matrix 

CoVn+l. 


p = [p\p\...,p\^-*^\]^ 

/ = max/3y^| 4 ■ I, j = 1,2,..., |u„+i| 

1 = 1 

where /3 is a fixed coefficient. The k-ih dimension of p is 
proportional to the maximal projection of the current eigen¬ 
vectors on the k-\h dimension, {i.e. p^ is weighted by their 
corresponding eigenvalues). Note that w{r^_^i) is a function 
of the residual error which should be calculated for each vector 
dimension. And the computation cost for w{r^_^i) can be 
neglected in the analytical solution. 
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Fig. 3. An example to demonstrate the robustness of model maintenance. In 
the scenario of dynamic water surfaces, we visualize the original and predicted 
intensities for a fixed position (denoted by the red star), with the blue and red 
curves, respectively. With our updating scheme, when the position is occluded 
by a foreground object during from frame 551 to 632, the predicted intensities 
are not disturbed by foreground, i.e. the model remains stable. 


Accordingly, we can update the observation Vn+i over each 
dimension by computing the function 

^’n+l = \lw{r’^n+l)Vn+l- ( 28 ) 

That is, we treat Vn+i as the new observation during the 
procedure of incremental learning. 

(i) Step 2. With the fixed C^+i, we then update the 
parameters of state variations 5n+i- We first estimate 

the latest state Zn+i based on the updated C^+i as, 

^n-\-l — ( 29 ) 

can be further calculated, by re-solving the linear 
problem of a fixed number of latest observed states, 

^n+l [ ^n—Z+1 ' ' ' ~ [ ^n—Z+2 ' ' ' ^n+1 ]-> (20) 

where I indicates the number of latest observed states, i.e. the 
span of observations. And similarly, we update 5n+i by com¬ 
puting the new reconstruction error E = [zn- 1^2 * * * ^n+i] ~ 

^n+l [^n—Z+1 ' ' ' ^n]- 

We present an empirical study in Fig. to demonstrate 
the effectiveness of this updating method. The video for 
background modeling includes dynamic water surfaces. Here 
we visualize the original and predicted intensities for a fixed 
position (denoted by the red star), with the blue and red curves, 
respectively. We can observe that the model remains stable 
against foreground occlusion. 

Time complexity analysis. We mainly employ SVD and lin¬ 
ear programming in the initial learning. The time complexity 
of SVD is O(n^) and the learning time of linear programming 
is O(n^). For a certain location, the time complexity of initial 
learning is O(n^) + (9(n^) = 0 {n^) for each subspace, where 
n denotes the number of video bricks for model learning. As 
for online learning, incremental subspace learning and linear 


Algorithm 1: The sketch of the proposed algorithm. 
Input: Video brick sequence V = 7 ^ 2 , • • •, fof 

every location for the scene. 

Output: Maintained Background models and foreground 
regions 

forall the locations for the scene do 

Given the observed video bricks V, extract the 
CS-STLTP descriptor; 

Initialize the sub^ace by estimating C^, 
using Equation ([8)-(14); 
for the newly appearing video brick Vn+i do 

(1) Extract the CS-STLTP descriptor for Vn+i\ 

(2) Calculate its state residual and app earances 
residual uJn+i by Equation dl^ and (17); 

(3) For each pixel of Vn+i, Ossify it into 
foreground or background by thresholding the 
two residuals with e^, uJn+i\ 

(4) Generate the noise-free brick Vn+i from the 
current model by Equation (18); 


(5) Synthesize video brick for model 
updating; 

(6) Update fin+i into i)n+i by introducing a 
robustness function; 

(7) Update the new appearance parameter C^+i 
by calculating the covariance matrix Cov^+i 
with the learning rate a\ 

(8) Update the state variation parameters 

-^n+1 ? 


end 


end 


programming are utilized. Given a d-dimension subspace, 
the time complexity for component updating (i.e. step 1 
of the model maintenance) is 0(din?). Thus, the total time 
complexity for online learning is 0 (dn?) + (3(/^), where I is 
the number of states used to solve the linear problem. 

We summarize the algorithm sketch of our framework in 
Algorithm 


IV. Experiments 

In this section, we first introduce the datasets used in 
the experiments and the parameter settings, then present the 
experimental results and comparisons. The discussions of 
system components are proposed at last. 

A. Datasets and settings 

We collect a number of challenging videos to validate our 
approach, which are publicly available or from real surveil¬ 
lance systems. Two of them (AirportHall and TrainStation) 
from the PETS databas^ include crowded pedestrians and 
moving cast shadows; five highly dynamic scenes include 
waving curtain active fountain, swaying trees, water surface; 
the others contain extremely difficult cases such as heavy rain, 

^Downloaded from http://www.cvg.rdg.ac.uk/slides/pets.html 

^Downloaded from http://perception.i2r.a-star.edu.sg 
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sudden and gradual light changing. Most of the videos include 
thousands of frames, and some of the frames are manually an¬ 
notated as the ground-truth provided by the original databases. 

Our algorithm has been adopted in a real video surveillance 
system and achieves satisfactory performances. The system 
is capable of processing 15 ^ 20 frames per second in the 
resolution 352 x 288 pixels. The hardware architecture is 
an Intel i7 2600 (3.4 GHz) CPU and 8GB RAM desktop 
computer. 

All parameters are hxed in the experiments, including the 
contrast threshold for CS-STLTP descriptor r = 0.2, the 
dimension threshold for ARMA model = 0.5, =0.5, 

the span of observations for model updating I = 60, and the 
size of bricks 4x4x5. For foreground segmentation, the 
threshold of appearance residual = 3, update threshold 
Tg = 3 and = 5, Tg = 4 for RGB. In the online model 
maintenance, the coefficient p = 2.3849, the learning rate 
a = 0.05 for RIPCA. 

In the experiments, we use the hrst 50 frames of each 
testing video to initialize our system (i.e. to perform the initial 
learning), and keep model updated in the rest of sequence. In 
addition, we utilize a standard post-processing to eliminate 
areas including less than 20 pixels. Ah other competing 
approaches are executed with the same setting as our approach. 

We utilize the F-score as the benchmark metric, which 
measures the segmentation accuracy by considering both the 
recall and the precision. The F-score is dehned as 

2TP 

rp _ _ 131) 

2TP + FP-f FAT’ ^ ^ 

where TP is true positives (foreground objects), FN false 
negatives (false background pixels), FP false positive (false 
foreground pixels). 

B. Experimental results 

Experimental results. We compare the proposed method 
(STDM) with six state-of-the-art online background subtrac¬ 
tion algorithms including Gaussian Mixture Model (GMM) ||T| 
as baseline, improved GMM (810 online auto-regression 
model (20l, non-parametric model with scale-invariant local 
patterns ca, discriminative model using generalized Struct 
1-SVM (190 and the Bayesian joint domain-range (JDR) 
model inT0 ~^ In the comparisons, for the methods m, m, 
CD, Ell we use their released codes, and implement the 
methods (Hi, (201 by ourselves. The F-scores (%) over all 
10 videos are reported in Table |T| where the last two columns 
report results of our method using either RGB or CS-STLTP as 
the feature. Note that for the result using the RGB feature we 
represent each video brick by concatenating the RGB values 
of ah its pixels. We also exhibit the results and comparisons 
using the precision-recall (PR) curves, as shown in Fig. 
Due to space limitation, we only show results on 5 videos. 
From the results, we can observe that the proposed method 
outperforms the other methods in most videos in general. For 

^Available at http://dparks.wikidot.com/background-subtraction 

^Available at http://www.cs.mun.ca/~gong/Pages/Research.htmI 

^Available at http://www.cs.cmu.edu/~yaser/ 



_i_^_I_^_ 

0.4 0.5 0.6 0.7 0.8 0.9 


Recall 

Fig. 4. Experimental results generated by our approach and competing 
methods on 5 videos: first row left, the scene including a dynamic curtain 
and indistinctive foreground objects {i.e. having similar appearance with 
backgrounds); first row right, the scene with heavy rain; second row left, 
an indoor scene with the sudden lighting changes; second row right, the 
scene with dynamic water surface; third row, a busy airport. The precision- 
recall (PR) curve is introduced as the benchmark measurement for all the 6 
algorithms. 


the scenes with highly dynamic backgrounds (e.g., the #2 #5 
and #10 scenes), the improvements made by our method are 
more than 10%. And the system enables us to well handle 
the indistinctive foreground objects (i.e. small objects or 
background-like objects in the #1, #3 scenes). Moreover, we 
make significant improvements (i.e. 15% ^ 25%) in the scene 
#6 and #7 including both sudden and gradual lighting changes. 
A number of sampled results of background subtraction are 
exhibited in Fig. 

The benefit of using the proposed CS-STLTP feature is 
clearly validated by observing the results shown in Table |I] and 
Fig. In general, our approach simply using RGB values can 
achieve satisfying performances for the common scenes, e.g., 
with fair appearance and motion changes, while the CS-SILTP 
operator can better handle highly dynamic variations (e.g. 
sudden illumination changing, rippling water). In addition, 
we also compare CS-STLTP with the existing scale invariant 
descriptor SILTP proposed in C3. We reserve all settings 
in our approach except replacing the feature by SILTP, and 
achieve the average precision over all 10 videos: 69.70%. This 
result shows that CS-STLTP is very suitable and effective for 
the video brick representation. 
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Fig. 5. Sampled results of background subtraction generated by our approach (using RGB or CS-STLTP as the feature and RIPCA as the update strategy) 
and other competing methods. 
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TABLE I 

Quantitative results and comparisons on the 10 complex videos using the F-score (%) measurement. The last two columns report 

THE RESULTS OE OUR METHOD USING EITHER RGB OR CS-STLTP AS THE EEATURE. 


Scene 

GMMdl 

Im-GMMdl 

Online-AR (20) 

JDRfTTl 

svmQD 

PKDEfl^ 

STDM(RGB) 

STDM(Ftr.) 

I# Airport 

46.99 

47.36 

62.72 

60.23 

65.35 

68.14 

70.52 

66.40 

2# Floating Bottle 

57.91 

57.77 

43.79 

45.64 

47.87 

59.57 

69.04 

78.17 

3# Waving Curtain 

62.75 

74.58 

77.86 

72.72 

77.34 

78.01 

79.74 

74.93 

4# Active Fountain 

52.77 

60.11 

70.41 

68.53 

74.94 

76.33 

76.85 

85.46 

5# Heavy Rain 

71.11 

81.54 

78.68 

75.88 

82.62 

76.71 

79.35 

75.29 

6# Sudden Light 

47.11 

51.37 

37.30 

52.26 

47.61 

52.63 

51.56 

74.57 

7# Gradual Light 

51.10 

50.12 

13.16 

47.48 

62.44 

54.86 

54.84 

77.41 

8# Train Station 

65.12 

68.80 

36.01 

57.68 

61.79 

67.05 

73.43 

66.35 

9# Swaying Trees 

19.51 

23.25 

63.54 

45.61 

24.38 

42.54 

43.71 

75.89 

10# Water Surface 

79.54 

86.01 

77.31 

84.27 

83.13 

74.30 

88.54 

88.68 

Average 

55.39 

59.56 

57.02 

60.23 

59.79 

63.08 

68.75 

76.31 




Fig. 6. Discussion of parameter selection: (i) learning rate a for model 
maintenance (in (a)) and (ii) the contrast threshold of CS-STLTP feature r 
(in (b)). In each figure, the horizontal axis represents the different parameter 
values; the three lines in different colors denote, respectively, the false positive 
(FP), false negative (FN), and the sum of FP and FN. 



Fig. 7. Empirical study for the size of video brick in our approach. We 
carry on the experiments on the 10 videos with different brick size while 
keeping the rest settings. The vertical axis represents the average precisions 
of background subtraction and the horizontal represents the different sizes of 
video bricks with respect to background decomposition. 


C Discussion 

Furthermore, we conduct the following empirical studies 
to justify the parameter determinations and settings of our 
approach. 

Efficiency. Like other online-learning background models, 
there is a trade-off between the model stability and mainte¬ 
nance efficiency. The corresponding parameter in our method 
is the learning rate a. We tune a in the range of 0 ^ 0.3 by 
fixing the other model parameters and visualize the quantita¬ 
tive results of background subtraction, as shown in Fig. |^a). 


From the results, we can observe this parameter is insensitive 
in range 0 ^ 0.1 in our model. In practice, once the scene 
is extremely busy and crowded, it could be set as a relative 
small value to keep the model stable. 

Feature effectiveness. The contrast threshold r is the only 
parameter in CS-STLTP operator, which affects the power of 
feature to character spatio-temporal information within video 
bricks. From the empirical results of parameter tuning, as 
shown in Fig.|^(b), we can observe that the appropriate range 
for r is 0.15 ^ 0.25. In practice, the model could become 
sensitive to noise by setting a very small value of r (say 
r < 0.15), and too large r (say r > 0.25) might reduce the 
accuracy on detecting foreground regions with homogeneous 
appearances. 

Size of video brick. One may be interested in how the 
system performance is affected by the size of video brick for 
background decomposition, so that we present an empirical 
study on different sizes of video bricks in Fig. |7] We observe 
that the best result is achieved with the certain brick size of 
4x4x3, and the results with the sizes of 4 x 4 x 1 and 
4 X 4 X 5 are also satisfied. As of very small bricks {e.g. 1x1x3 
), few spatio-temporal statistics are captured and the models 
may have problems on handling scene variations. The bricks 
of large sizes (^.g. 8 x 8 x 5 ) carry too much information, and 
their subspaces cannot be effectively generated by the linear 
ARM A model. The experimental results are also accordant 
with our motivations in Section 1. In practice, we can flexibly 
set the size according to the resolutions of surveillance videos. 

Model initialization. Our method is not sensitive to the 
number of observed frames in the initial stage of subspace 
generation. We test the different numbers, say 30, 40, 60, on 
two typical surveillance scenes, i.e. the Airport Hall (scene 
#1) and the Train Station (scene #8). The F-score outputs 
show the deviations with different numbers of initial frames 
are very small, e.g. less than 0.2. In general, we require 
the observed scenes to be relatively clean for initialization, 
although a few objects that move across are allowed. 

V. Conclusion 

This paper studies an effective method for background 
subtraction, addressing the all challenges in real surveillance 
scenarios. In the method, we learn and maintain the dynamic 
texture models within spatio-temporal video patches (i.e. video 
























IEEE TRANSACTIONS ON IMAGE PROCESSING, 2014. 


11 


bricks). Sufficient experiments as well as empirical analysis 
are presented to validate the advantages of our method. 

In the future, we plan to improve the method in two aspects. 

(I) Some efficient tracking algorithms can be employed into 
the framework to better distinguish the foreground objects. (2) 
The GPU-based implementation can be developed to process 
each part of the scene in parallel, and it would probably 
significantly improve the system efficiency. 
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